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Abstract 

We describe and examine an algorithm for tomographic image recon¬ 
struction where prior knowledge about the solution is available in the 
form of training images. We first construct a nonnegative dictionary 
based on prototype elements from the training images; this problem is 
formulated as a regularized non-negative matrix factorization. Incorpo¬ 
rating the dictionary as a prior in a convex reconstruction problem, we 
then find an approximate solution with a sparse representation in the 
dictionary. The dictionary is applied to non-overlapping patches of the 
image, which reduces the computational complexity compared to other 
algorithms. Computational experiments clarify the choice and interplay 
of the model parameters and the regularization parameters, and we show 
that in few-projection low-dose settings our algorithm is competitive with 
total variation regularization and tends to include more texture and more 
correct edges. 


1 Introduction 

Computed tomography (CT) is a technique to compute an image of the interior 
of an object from measurements obtained by sending X-rays through the object 
and recording the damping of each ray. CT is used routinely in medical imaging, 
materials science, nondestructive testing and many other applications. 
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CT is an inverse problem [30] and it is challenging to obtain sharp and 
reliable reconstructions in low-dose measurements where we face underdeter¬ 
mined systems of equations, because we must limit the accumulated amount 
of X-rays for health reasons or because measurement time is limited. In these 
circumstances the classic methods of CT, such as filtered back projection [20] 
and algebraic reconstruction techniques [14], are often incapable of producing 
satisfactory reconstructions because they fail to incorporate adequate prior in¬ 
formation [3]. To overcome these difficulties it is necessary to incorporate a 
prior about the solution that can compensate for the lack of data. 

A popular prior is that the image is piecewise constant, leading to total 
variation (TV) regularization schemes [21], [37]. These methods can be very 
powerful when the solution is approximately composed of homogeneous regions 
separated by sharp boundaries. 

An completely different approach is to use prior information in the form of 
“training images” that characterize the geometrical or visual features of interest, 
e.g., from high-accuracy reconstructions (the typical case) or from pictures of 
specimen slices. The goal of this work is to elaborate on this approach. In par¬ 
ticular we consider the two-stage framework where the most important features 
of the training data are first extracted and then integrated in the reconstruction 
problem. 

A natural way to extract and represent prior information from training im¬ 
ages is to form a dictionary that sparsely encodes the information [31]. Learning 
the dictionary from given training data appears to be very suited for incorpo¬ 
rating priors that are otherwise difficult to formulate in a closed form, such 
as image texture. Dictionary learning — combined with sparse representation 
[5, 9, 36] — is now used in many image processing areas including denoising [7], 
[24], inpainting [27], and deblurring [25]. Elad and Ahron [10] address the im¬ 
age denoising problem using a process that combines dictionary learning and 
reconstruction. They use a dictionary trained from a noise-free image using the 
K-SVD algorithm [1] combined with an adaptive dictionary trained on patches 
of the noisy image. 

The use of dictionary learning in tomographic imaging has also emerged re¬ 
cently, e.g., in X-ray CT [12, 38, 40], magnetic resonance imaging [16, 32], elec¬ 
tron tomography [26], positron emission tomography [8], and phase-contrast 
tomography [29]. Two different approaches have emerged — either one con¬ 
structs the dictionary from the given data in a joint learning-reconstruction 
algorithm [8, 16, 26, 32], or one constructs the dictionary from training images 
in a separate step before the reconstruction [12, 29, 38, 40]. Most of these works 
use K-SVD to learn the dictionary (except [12] that uses an “online dictionary 
learning method” [28]), and all the methods regularize the reconstruction by 
means of a penalty that is applied to a patch around every pixel in the image. 
In other words, all patches in the reconstruction should be close to the subspace 
spanned by the dictionary images. While all these methods perform better than 
classical reconstruction methods, they show no significant improvement over the 
TV-regularized approach. 

In simultaneous learning and reconstruction, where the dictionary is learned 
from the given data, the prior is purely data-driven. Hence, one can argue that it 
violates a fundamental principle of inverse problems where a data-independent 
prior is incorporated to eliminate unreasonable models that fit the data. Eor 
this reason we prefer to separate the two steps (which requires that reliable 
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training images are available). We describe and examine a two-stage framework 
where we first construct a dictionary that contains prototype elements from 
these images, and then we use the dictionary as a prior to regularize the recon¬ 
struction problem via computing a solution that has a sparse representation in 
the dictionary. 

Our two-stage algorithm is inspired by the work in [12] and, to some extent, 
[38]. The algorithm in [12] is tested on a simple tomography setup with no noise 
in the data and in [38] the dictionary is trained from an image reconstructed 
by a high-dose X-ray exposure and then used to reconstruct the same image 
with fewer X-ray projections. We utilize the dictionary in a different way using 
blocks of the image (to be discussed later) which reduces the number of un¬ 
knowns. We seek to use more realistic simulations with noisy data, we avoid 
committing “inverse crime,” and we perform a careful study of the sensitivity of 
the reconstruction to the different parameters in the reconstruction model and 
in the algorithm. Finally we compare our algorithm with both classical methods 
and with TV. We are not aware of comprehensive studies of the influence of the 
learned dictionary structure and dictionary parameters in CT. 

Our paper is organized as follows. In section 2 we briefly discuss dictionary 
learning methods and present a framework for solving the image reconstruction 
problem using dictionaries, and in Section 3 we describe the implementation 
details of algorithm. Section 4 presents careful numerical experiments where we 
study the influence of the algorithm and design parameters. Section 5 summa¬ 
rizes our work. We use the following notation, where A is an arbitrary matrix: 




maxij \Aij\. 


sum 


max 


2 The Reconstruction Framework 

X-ray CT is based on the principle that if we send X-rays through an object 
and measure the damping of each ray then, with infinitely many rays, we can 
perfectly reconstruct the object. The attenuation of an X-ray is proportional 
to the object’s attenuation coefficient, as described by Lambert-Beer’s law [6, 
§2.3.1]. We divide the domain onto pixels whose unknown nonnegative attenu¬ 
ation coefficients are organized in the vector x G Similarly we organize the 
measured damping of the rays into the vector b G Then we obtain a linear 
system of equations Ax = h with a large sparse system matrix governed solely 
by the geometry of the measurements: element aij is the length of the ith ray 
passing through pixel j, and the matrix is sparse because each ray only hits a 
small number of pixels [30]. 

The matrix A is ill-conditioned, and often rank deficient, due to the ill- 
posedness of the underlying inverse problem and therefore the solution is very 
sensitive to noise in the data b. For this reason, a simple least squares approach 
with nonnegativity constraints fails to produce a meaningful solution, and we 
must use regularization to incorporate prior information about the solution [13]. 

This work is concerned with underdetermined problems where m < n, and 
the need for regularization is even more pronounced. Classical reconstruction 
methods such as filtered back projection and algebraic iterative methods are not 
suited for these problems because they fail to incorporate enough prior infor¬ 
mation. TV regularization, which is suited for edge-preserving reconstructions. 
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takes the form 

min ^ ||Ax — 6 II 2 + Atv subject to a; > 0 , ( 1 ) 

l<i<n 

where we have included a nonnegativity constraint; Df^x is a finite-difference 
approximation of the gradient at pixel i, and Atv > 0 is a regularization param¬ 
eter. TV methods produce images whose pixel values are clustered into regions 
with somewhat constant intensity [34] , with the result that textural images tend 
to be over-smoothed (except for the sharp edges). Another drawback is that the 
TV problem ( 1 ) tends to produce reconstructions whose intensities are incorrect 

[34]. 

Our goal is to incorporate prior information — e.g., about texture — from a 
set of training images. We focus on formulating and finding a learned dictionary 
W from the training images and solving the tomography problem such that 
X = Wa is di sparse linear combination of the dictionary elements (the columns 
of W). We build on ideas from sparse approximation [5, 9, 36] which seeks an 
approximate representation of a signal/image using a linear combination of a 
few known basis elements. 

As mentioned in the Introduction, some works use a joint formulation that 
combines the dictionary learning problem and the reconstruction problem into 
one optimization problem, i.e., the dictionary is learned from the given noisy 
data. This corresponds to a “bootstrap” situation where one creates the prior 
as part of the solution process. Our work is different: we use a prior that is 
already available in the form of a set of training images, and we use this prior to 
regularize the reconstruction problem. To do this, we use a two-stage algorithm 
where we first compute the dictionary from the given training images, and then 
we use the dictionary to compute the reconstruction. 

The dictionary W should comprise all the important features of the desired 
solution. A learned dictionary — while computationally more expensive than a 
fixed dictionary — has the advantage that it is tailored to the characteristics of 
the desired solution and optimized for the training images. Dictionary learning 
is a way to summarize and represent a large number of training images into 
fewer elements and, at the same time, compensate for noise or other errors in 
these images. The learned dictionary should be robust to irrelevant features, 
and the number of training images should be large enough to ensure that all 
image features are represented; hence dictionaries are typically overcomplete. 

Using training images of the same size as the image to be reconstructed would 
require a huge number of training images and lead to an enormous dictionary. 
All algorithms therefore use a patch dictionary D learned from patches of the 
training images. But contrary to previous algorithms that apply a dictionary- 
based regularization based on overlapping patches around every pixel in the 
image, we divide the reconstruction into nonoverlapping blocks of the same size 
as the patches and use the dictionary D within each block (ensuring that we limit 
blocking effects); conceptually this corresponds to building a global dictionary 
W from D. 

Let the patches be of size P x and let the matrix Y G consist 

of t training image patches arranged as vectors of length p = PQ. Then the 
dictionary learning problem can be viewed as the problem of approximating the 
training matrix as a product of two matrices, Y ^ DH^ where D G is 
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the dictionary of s dictionary image patches (the columns of D)^ and H G 
contains information about the approximation of each of the training image 
patches. Such a decomposition is clearly not unique, so we must incorporate 
further requirements to “shape” the patch dictionary D and the representation 
matrix H. 

Imposing norm and/or non-negativity constraints on the elements of D and 
H or imposing sparsity constraint on matrix H are widely used in unsupervised 
learning. We take the same approach, and thus our generic dictionary learning 
problem takes the form: 

min ^dic {Y,DH) + ^aic{D) + {H). (2) 

D 

Here, the misfit of the factorization approximation is measured by the loss 
function ^dic? while the priors on the patch dictionary D and the representation 
matrix H are taken into account by the regularization functions ^dic and ^rep- 

The dictionary learning problem (2) is a non-convex optimization problem. If 
we choose the functions ^dic? ^dic and ^rep to be convex, then the optimization 
problem in (2) is not jointly convex in but it is convex with respect to 

each variable D or H when the other is fixed. A natural way to find a local 
minimum is therefore to use an alternating approach, first minimizing over H 
with D fixed, and then minimizing over D with H fixed. 

Various dictionary learning methods proposed in the literature share the 
same overall structure but they consider different priors when formulating the 
dictionary learning problem. Examples of such methods include, but are not 
limited to, non-negative matrix factorization [22], the method of optimal di¬ 
rections [11], K-means clustering [18] and its generalization K-SVD [10], and 
the online dictionary learning method [28]. The methods in [19] and [23] are 
designed for training data corrupted by additive noise; but this it is not relevant 
for our work. 

Having computed the patch dictionary D and formed the corresponding 
global dictionary W, the second step is to solve the reconstruction problem. 
Using ideas from sparse approximation, we compute a solution x = Wa where 
a solves the problem 

min ^rec{AWa, b) + ^sp(q^) + ^ip{W a), (3) 

a 

in which the data fidelity is measured by the loss function ^rec and regular¬ 
ization is imposed via penalty functions. Specifically, the function enforces 
the Sparsity Prior on a, often formulated in terms of a sparsity inducing norm, 
while the function ^ip enforces the Image Prior. If we choose the three func¬ 
tions ^rec 5 ^sp and ^ip to be convex, then the problem formulation (3) can be 
solved by means of convex optimization methods. Given a solution to (3) we 
compute the solution as x'^ = Wa^. In Section 4 we illustrate with numerical 
examples that the sparsity penalties in (2) and (3) tend to have a regularizing 
effect on the reconstruction. 


3 Details of Formulation and Implementation 

Recall that the proposed framework for dictionary-based tomographic recon¬ 
struction consists of two conceptual steps: (i) computing a dictionary (using 
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techniques from machine learning), and (ii) computing a reconstruction com¬ 
posed of images from the dictionary. In this section we describe one of many 
ways to efficiently implement such a scheme. We pose the diet ionary-learning 
problem as a so-called non-negative sparse coding problem, and we use least 
squares optimization with non-negative variables and 1-norm regularization to 
compute a reconstruction. 

3.1 The Dictionary Learning Problem 

Dictionary learning problems of the form (2) are generally non-convex optimiza¬ 
tion problems due to the bilinear term DH where both D and H are unknown. 
Applying a convergent iterative optimization method therefore does not guar¬ 
antee that we find a global minimum (only a local stationary point). To obtain 
a good dictionary, we must be careful when choosing the loss function ^dic and 
the penalties ^dic and ^rep on D and and we must also pay attention to 
implementation issues such as the starting point; see the Appendix for details. 

A non-negative matrix factorization (NMF) has the ability to extract mean¬ 
ingful factors [22], and with non-negative elements in D its columns represent 
a basis of images. Similarly, having non-negative elements in H corresponds to 
each training image being represented as a conic combination of dictionary im¬ 
ages, and the representation itself is therefore non-negative. NMF often works 
well in combination with sparsity heuristics [15] which in our application trans¬ 
lates to training image patches being represented as a conic combination of a 
small number of dictionary elements (basis images). 

The dictionary learning problem that we will use henceforth takes the form 
of non-negative sparse coding [15] of a non-negative data matrix Y : 

min i||F-Dff||| + A||ff|| 3 um s.t. D G P, F G (4) 

D 

where the set V is compact and convex and A > 0 is a regularization parameter 
that controls the sparsity-inducing penalty ||i^ ||sum- This problem is an instance 
of the more general formulation (2) if we define 

^dic(Y,DH) = ^\\Y-DH\\l 

and 

^dic(^) = Iv{D), ^rep{H) = 4;x‘(^) + A||ff||sum , 

where Iz denotes the indicator function of a set Z. Note that the loss function 
Zfdic is invariant under a scaling D (D and H for^ >0, and letting 

( ^ oc implies that $rep(C~^^) ^ 0 IICL^II ^ oo if D is nonzero. This 
means that V must be compact to ensure that the problem has well-defined 
minima. Here we will consider two different definitions of the set P, namely 

V^ = {DeRl^^\\\dj\\oo<l} and V 2 = {D \\\dj\\2 < Vp}^ 

The set V^o corresponds to box constraints, and P 2 is a spherical sector of the 
2 -norm ball with radius y^. As we will see in the next section, the use of Poo 
as a prior gives rise to binary-looking images (corresponding to the vertices of 
Poo) whereas P 2 gives rise to more “natural looking” images. 

We emphasize an important difference between the classical K-SVD method 
and our method. While K-SVD requires that we explicitly set the sparsity level. 


6 



SoltaniEtAl 


Tomographic Image Reconstruction using Training Images 


in our approach we affect sparsity implicitly through 1-norm regularization and 
via the regularization parameter A. 

We use the Alternating Direction Method of Multipliers (ADMM) [4] to 
compute an approximate local minimizer of (4). Learning the dictionary with 
the ADMM method has the advantages that the updates are cheap to compute, 
making the method suited for large-scale problems. The implementation details 
are described in the Appendix. 

3.2 The Reconstruction Problem 

Recall that we formulate the CT problem as Ax ^ 6, where b contains the noisy 
data and A is the system matrix. The vector x represents an M x A" image 
of absorption coefficients, and these coefficients must be nonnegative to have 
physical meaning. Hence we must impose a nonnegativity constraint on the 
solution. 

Let us turn to the reconstruction problem based on the patch dictionary D 
and the formulation (3). For ease of our presentation we assume that the image 
size M X A is a multiple of the patch size P x Q, and we partition the image 
into an (M/P) x (N/Q) array of non-overlapping blocks or patches represented 
by the vectors Xj G for j = {M/P){N/Q). The advantage of 

using non-overlapping blocks, compared to overlapping blocks, is that we avoid 
over-smoothing the image textures when averaging over the overlapping regions, 
and it requires less computing time. 

Each block of x is expressed as a conic combination of dictionary images, 
and hence the dictionary prior is expressed as 


Ux = Wa, W = {I 0 P), > 0, 


(5) 


where H is a permutation matrix that re-orders the vector x such that we 
reconstruct the image block by block, W is the global dictionary, and 



is a vector of coefficients for each of a total of q blocks. With this non-overlapping 
formulation, it is straightforward to determine the numebr of unknowns in the 
problem (8). The length of a is sq = n s/p which is equal to the product of the 
over-representation factor s/p and the number of pixels n in the image. 

In pursuit of a nonnegative image x, we impose the constraint that the 
vector a should be nonnegative. This implies that each block Xj of x lies inside 
a polyhedral cone 


C = {Dz\ze C 


(6) 


as illustrated in Figure 1. Clearly, if the dictionary contains the standard basis of 
MP then C is equivalent to the entire nonnegative orthant in However, if the 
cone C is a proper subset of then not all nonnegative images have an exact 
representation in C, and hence the constraints Xj G C may have a regularizing 
effect even without a sparsity prior on a. This can also be motivated by the 
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Figure 1: Polyhedral cone in spanned by five nonnegative dictionary ele¬ 
ments, where denotes the ith canonical unit vector in 


fact that the faces of the cone C consist of images xj that can be represented as 
a conic combination of at most p — 1 dictionary images. 

Adding a sparsity prior on a, in addition to nonnegativity constraints, corre¬ 
sponds to the assumption that Xj can be expressed as a conic combination of a 
small number of dictionary images and hence provides additional regularization. 
We include a 1-norm regularizer in our reconstruction problem as the standard 
approximate sparsity prior on a. 

Reconstruction based on non-overlapping blocks often gives rise to block ar¬ 
tifacts in the reconstruction because the objective in the reconstruction problem 
does not penalize jumps across the boundaries of neighboring blocks. To miti¬ 
gate this type of artifact, we add a penalty term that discourages such jumps. 
We choose a penalty of the form 

i,{z) = l\\Lz\\l/i, £ = M{M/P-1) + N{N/Q-1) (7) 

where L is a matrix such that L ^ is a vector with finite-difference approxima¬ 
tions of the directional derivatives across the block boundaries. The factor £ is 
the total number of pixels along the boundaries of the blocks in the image. 

The constrained least squares reconstruction problem is then given by 

minimizea 0 D)a - b\\l + iJ,^\\a\\i + 5'^ ^ D)a) 

subject to (a > 0 ^ 

with regularization parameters > 0. We normalize the problem formulation 
by i) division of the squared residual norm by the number of measurement m, 
ii) division of the 1-norm of a by the number of blocks and hi) division by 
£ in the function Problem (8) is convex and it is an instance of a sparse 
approximation problem similar to formulations studied in [10]. 

4 Numerical Experiments 

In this section we use numerical examples to demonstrate and quantify the be¬ 
havior of our two-stage algorithm and evaluate the computed reconstructions. 
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In particular we explore the influence of the dictionary structure and its pa¬ 
rameters (number of elements, patch sizes) on the reconstruction, in order to 
illustrate the role of the learned dictionary. 

The underlying idea is to compute a regularized least squares fit in which 
the solution is expressed in terms of the dictionary, and hence it lies in the 
cone C (6) defined by the dictionary elements. Hence there are two types of 
errors in the reconstruction process. Typically, the exact image does not lie 
in the cone C, leading to an approximation error. Moreover, we encounter a 
regularization error due to the combination of the error present in the data and 
the regularization scheme. 

In the learning stage we use a set of images which are similar to the ones we 
wish to reconstruct. The ground-truth or exact image is not contained in 

the training set, so that we avoid committing an inverse crime. All images are 
gray-level and scaled in the interval [0,1]. 

All experiments were run in MATLAB (R2011b) on a 64-bit Linux system. 
The reconstruction problems are solved using the software package TFOCS 
(Templates for First-Order Conic Solvers) [2]. We compare with TV recon¬ 
structions computed by means of the MATLAB software TVReg [17], with 
filtered back projection solutions computed by means of MATLAB’s “iradon” 
function, and solutions computed by means of the algebraic reconstruction tech¬ 
nique (ART, also known as Kaczmarz’s method) with nonnegativety constraints 
implemented in the MATLAB package AIR Tools [14]. (We did not compare 
with Krylov subspace methods because they are inferior to ART for images with 
sharp edges.) 

4.1 The Test Image and the Tomographic Test Problem 

The test images used in Sections 4.2-4.5 are square patches from a high-resolution 
photo of peppers with uneven surfaces resembling texture, making them inter¬ 
esting test images for studies of the reconstruction of textures and structures 
with sharp boundaries. Figure 2 shows the 1600 x 1200 high-resolution im¬ 
age and the exact image of dimensions M x N = 200 x 200. This size allows 
us to perform many numerical experiments in a reasonable amount of time; 
we demonstrate the performance of our algorithm on a larger test problem in 
Section 4.6. 

All test problems represent a parallel-beam tomographic measurement, and 
we use the function paralleltomo from AIR TOOLS [14] to compute the system 
matrix A. The data associated with a set of parallel rays is called a projection, 
and number of rays in each projection is given by W = [v^Vj. If the total 
number of projections is then the number of rows in A is m = while the 

number of columns is n = MN. Recall that we are interested in scenarios with 
a small number of projections. The exact data is generated with the forward 
model after which we add Gaussian white noise, i.e., b = -h e. 

4.2 Studies of the Dictionary Learning Stage 

A good dictionary should preserve the structural information of the training 
images as much as possible and, at the same time, admit a sparse representation 
as well as a small representation error. These requirements are related to the 
number of dictionary elements, i.e., the number of columns s in the matrix 
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Figure 2: Left: the high-resolution image from which we obtain the training 
image patches. Right: the 200 x 200 exact image 


D G Since we want a compressed representation of the training images 

we choose s such that p < s and the precise value will be investigated. 

The optimal patch size P x Q is unclear and will also be studied; without loss 
of generality we assume P = Q. 

The regularization parameter A in (4) balances the matrix factorization error 
and the sparsity constraint on the elements of the matrix H. The larger the 
A, the more weight is given to minimization of ||iL||sum, while for small A more 
weight is given to minimization of the factorization error. If A = 0 then (4) 
reduces to the classical nonnegative matrix factorization problem. 

From the analysis of the upper bound on the regularization parameter A 
in the Appendix, we know X > p implies H = 0; so A can be varied in the 
interval (0,p] to find dictionaries with different sparsity priors. Note that the 
scaling of the training images affects the scaling of the matrix H as well as the 
regularization parameter A. 

To evaluate the impact of the dictionary parameters, we use three different 
patch sizes (5x5, 10 x 10, and 20 x 20) and the number of dictionary elements 
s is chosen to be 2, 3, and 4 times the of the number of rows p in the dictionary 
D. We extract more than 50, 000 overlapping patches from the high-resolution 
image in Figure 2. For different combinations of patch sizes and number of 
dictionary elements we solve the dictionary learning problem (4). 

Figure 3 shows examples of such learned dictionaries, where columns of D 
are represented as images; we see that the penalty constraint D G Poo gives 
rise to “binary looking” dictionary elements while P G P 2 results in dictionary 
elements that use the whole gray-scale range. 

To evaluate the approximation error, i.e., the distance of the exact image 
^exact projection on the cone C (6), we compute the solutions a'j to the q 

approximation problems for all blocks j = 1, 2,..., g in 

min III Pa j 11 2 s.t. > 0. (9) 

Then = Da^ is the best representation/approximation of the jth 
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Figure 3: Examples of dictionary elements. Top row: with the constraint D G 
Vqo the images appear as “binary looking.” Bottom row: with the constraint 
D GV 2 the images appear to use the whole gray-scale range. 


block in the cone. The mean approximation error (MAE) is then computed as 


1 <7 1 

MAE= - V—||Pc(:rf 


‘)--1 


l2' 


( 10 ) 



Figure 4: Mean approximation errors (10) for both D G Poo and P G P2 with 
different patch sizes and different s. 

The ability of the dictionary to represent features and textures from the 
training images, which determines how good reconstructions we are able to 
compute, depends on the regularization parameter A, the patch size, and the 
number of dictionary elements. Figure 4 shows how the mean approximation 
error MAE (10) associated with the dictionary varies with patch size p, number 
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of dictionary elements s, and regularization parameter A. An advantage of larger 
patch sizes is that the variation of MAE with s and A is less pronounced than for 
small patch sizes, so overall we tend to prefer larger patch sizes. In particular, 
for a large patch size we can use a smaller over-representation factor s/p than 
for a small patch size. As A approaches p we have that ||i^||sum approaches 
0, the dictionary D takes arbitrary values, and the approximation errors level 
off at a maximum value. Regarding the two different constraints D G Poo and 
D G V 2 we do not see any big difference in the approximation errors for 10 x 10 
and 20 x 20 patches; to limit the amount of results we now use ^ 2 - 

The computational work depends on the patch size and the number of dic¬ 
tionary elements which, in turn, affects the approximation error: the larger the 
dictionary, the smaller the approximation error, but at a higher computational 
cost. We have found that a good trade-off between the computational work and 
the approximation error can be obtained by increasing the number of dictionary 
elements until the approximation error levels off. 

4.3 Studies of the Reconstruction Stage 

Here we evaluate the overall reconstruction framework including the effect of the 
reconstruction parameters as well as their connection to the dictionary learning 
parameter A and the patch size. 

We solve the reconstruction problem (8) using projection data based on 
the exact image given in Eigure 2. We choose A^p = 25 uniformly distributed 
projection angles in [0°, 180°]. Hence the matrix A has dimensions m = 7,050 
and n = 40, 000, so the problem is highly underdetermined. We use the relative 
noise level ||e|| 2 /||Ax °"^^°^||2 = 0.01. Moreover, we use 5 x 5, 10 x 10 and 20 x 20 
patches and corresponding dictionary matrices and in V 2 of 

size 25 X 100, 100 x 300, and 400 x 800, respectively. Examples of the dictionary 
elements are shown in the bottom row of Eigure 3. 

We first investigate the reconstruction’s sensitivity to the choice of A in the 
dictionary learning problem and the parameters p and S in the reconstruction 
problem. It follows from the optimality conditions of (8) that a* = 0 is optimal 
when p > p = ^\\{I 0 D'^)I[A^b\\oo and hence we choose p G [0,/i]. Large 
values of p refer to the case where the sparsity prior is strong and the solution 
is presented with too few dictionary elements. On the other hand if p is small 
and a sufficient number of dictionary elements are included, the reconstruction 
error worsens only slightly when p decreases. In the next subsection we show 
that we may, indeed, obtain reasonable reconstructions for /i = 0. 

To investigate the effect of regularization parameters A and /r, we first per¬ 
form experiments with (5 = 0 corresponding to no image prior. The quality of a 
solution X is evaluated by the reconstruction error 

RE= (11) 

and Eigure 5 shows contour plots of RE as a function of A and p/q. The 
reconstruction error is smaller for larger patch sizes, and also less dependent 
on the regularization parameter A and the normalized regularization parameter 
p/q. The smallest reconstruction errors are obtained in all dictionary sizes for 
A?^3. 

Let us now consider the reconstructions when J > 0 in order to reduce block 
artifacts. Eigure 6 shows contour plots of the reconstruction error versus p/q 
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|i/q 

(a) 5x5 patches 




0.5 

0.45 

0.4 

0.35 

0.3 

0.25 

0.2 


(c) 20 X 20 patches 

Figure 5: Contour plots of the reconstruction error RE (11) versus A and [ijq. 



|4/q 


(a) 5x5 patches 

Figure 6: Contour plots of the reconstruction errors RE (11) versus {ijq and ^ 
for a fixed A = 3.16. 



and (5, using a fixed A = 3.16. It is no surprise that introducing b acts as a 
regularizer that can significantly reduce blocking artifacts and thus improve the 
reconstruction. Sufficiently large values of S yield smaller reconstruction errors. 
Consistent with the results from Figure 5, the reconstruction errors are smaller 
for 10 X 10 and 20 x 20 patch sizes than for 5x5 patches. For larger patch 
sizes (which allow for capturing more structure in the dictionary elements) the 
reconstruction error is quite insensitive to the choice of S and fi. The contour 
plots in Figure 6 suggest that with our problem specification, we should choose 
1 . 

Finally, in Figure 7 we compare our reconstructions with those computed by 
means of filtered back projection (FBP), the algebraic reconstruction technique 
(ART), and TV regularization. We used the Shepp-Logan filter in “iradon.” 
To be fair, the TV regularization parameter and the number of ART iterations 
were chosen to yield an optimal reconstruction. 

• The FBP reconstruction contains the typical artifacts associated with this 
method for underdetermined problems, such as line structures. 

• The ART reconstruction-although having about the same RE as our re¬ 
construction - is blurry and contains artifacts such as circle structures and 
errors in the corners. 

• The TV reconstruction has the typical “cartoonish” appearance of TV 
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solutions and hence it fails to include most of the details associated with 
the texture; the edges of the pepper grains are distinct but geometrically 
somewhat un-smooth. 

• Our reconstructions, while having about the same RE as the TV recon¬ 
struction, include more texture and some of the details from the exact 
image (but not all) are recovered, especially with Also the pepper 

grain edges resemble more the smooth edges from the exact image. 

We conclude that our dictionary-based reconstruction method appears to have 
an edge over the other three methods. 


(a) FBP, RE = 0.481 


10 iterations 



(b) ART, RE = 0.225 


?i^^=1.833 



(c) TV, RE = 0.214 



|i=34.5, 5=13.3 



(d) 5 X 5, RE = 0.224 


|i=8.62, 5=13.3 



(e) 10 X 10, RE = 0.220 


11=2.15, 5=237 



(f) 20 X 20, RE = 0.226 


Eigure 7: Reconstructions for different patch sizes, with D e V 2 and A = 3.16, 
compared with the EBP, ART and TV solutions. Note that in all our three 
reconstructions /i/g = 0.022. RE denotes the reconstruction error (11). 

Our formulation in (8) enforces that the solution is an exact representation 
in the dictionary, and searching for solutions in the cone spanned by the dictio¬ 
nary elements is a strong assumption in the reconstruction formulation. In [33] 
we investigated this requirement experimentally and showed that relaxing the 
equality Hx = {I 0 D) does not give an advantage, i.e., approximating a solu¬ 
tion by Hx ^ {I ^D)a and minimizing ||nx — {10 D)a \\2 does not improve the 
reconstruction quality, and one can compute a good reconstruction as a conic 
combination of the dictionary elements. 
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4.4 Simplifying the Computational Problem 


We have been working under the assumption that a > 0 and that it is sparse. 
Imposing both non-negativity and a 1-norm constraint on the representation 
vector a are strong assumptions in the reconstruction formulation. If we drop 
the non-negativity constraint in the image reconstruction problem, then ( 8 ) 
takes the form of a constrained least squares problem: 


1 

min - 

a 2 




s.t. ||a||i< 7 , 


( 12 ) 


where 7 > 0. Alternatively we can neglect the parameter /i. This is motivated 
by the plots in Figures 5 and 6 which suggest that for sufficiently large A, S and 
patch sizes, the reconstruction error is almost independent of /i as long as it 
is small. When /i = 0 ( 8 ) reduces to a nonnegatively constrained least square 
problem: 


1 

min - 

a 2 




s.t. Of > 0 . 


(13) 


We use the same test problem with 25 projections and relative noise level 
0.01 as in Section 4.3. We solve problem ( 12 ) for G P 2 , which resulted 

in the smallest reconstruction error when solving ( 8 ) (cf. Figure 7). Likewise 
we choose 10 x 10 and 20 x 20 patch sizes and G T >2 to solve (13). 

Figures 8 and 9 show the respective reconstructions. 


5=1000,7=158.49 



X 7 


Figure 8 : Contour plots of the reconstruction error RE for problem ( 12 ), similar 
to Figures 5 and 6 . Left: RE versus A and 7 when ^ = 0. Middle: RE versus 7 
and S with fixed A = 10. Right: The best reconstruction with RE = 0.243. 

There are two difficulties with the reconstructions computed via (12). The 
lack of a nonnegativity constraint on a can lead to negative pixel values in the 
reconstruction, and this is undesired because it is nonphysical and it leads to a 
larger reconstruction error . Also, as can be seen in Figure 8 , the reconstruction 
is very sensitive to the choose of the regularization parameter 7 ; it must be 
sufficiently large to allow the solution to be represented with a sufficient number 
of dictionary elements, and it should be carefully chosen to provide an acceptable 
reconstruction. 

The solution to problem (13) for a 20 x 20 patch size, compared to the 
solution shown in Figure 7, is not significantly worse both visually and in terms 
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5=13.34 



Figure 9: Plots of reconstruction error versus S for problem (13), using fixed 
A = 3.16 and ju = 0, together with the best reconstructions with RE = 0.242 
and RE = 0.231. Top and bottom correspond to patch sizes 10 x 10 and 20 x 20, 
respectively. 


of reconstruction error. This suggests that using the dictionary obtained from 
(4) with a proper choice of A and patch size and a nonnegatively constraint may 
be sufficient for the reconstruction problem, i.e., we can let ju = 0. While this 
seems to simplify the problem - going from (8) to (13) - it does not significantly 
simplify the computational optimization problem, since the 1-norm constraint 
is handled by simple thresholding in the software; but it help us to get rid of a 
parameter in the reconstruction process. However, when the 1-norm constraint 
is omitted, additional care is necessary when choosing A and the patch sizes to 
avoid introducing artifacts or noise. 

4.5 Studies of Robustness 

To further study the performance of our algorithm, in this section we consider 
reconstructions based on (8) with more noise in the data, and with projections 
within a limited range. The first two tests use 25 and 50 projections with 
uniform angular sampling in [0°, 180°] and with relative noise level = 0.05, i.e., 
a higher noise level than above. For our highly underdetermined problems we 
know that both filtered back projection and algebraic iterative techniques give 
unsatisfactory solutions, and therefore we only compare our method with TV. 
As before the regularization parameters A and /i are chosen from numerical 
experiments such that a solution with the smallest error is obtained. 

The reconstructions are shown in the top and middle rows of Figure 10. The 
reconstruction errors are still similar across the methods. Again, the TV recon- 
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|a=8.62, 5=1000 


5=316.23 


;i^^=16.238 



(d) RE = 0.220 (e) RE = 0.222 (f) TV, RE = 0.215 



Figure 10: The left and middle columns show our reconstructions with A = 3.16 
using and respectively; the right column shows the TV reconstruc¬ 
tions. Top and middle rows: = 25 and = 50 projections in [0°,180°] 

and relative noise level 0.05. Bottom row: = 25 projections in [0°, 120°] and 

relative noise level 0.01. 


structions have the characteristic “cartoonish” appearance while the dictionary- 
based reconstructions retain more the structure and texture but have other ar¬ 
tifacts - especially for Vp = 25. We also note that these artifacts are different 
for the two different dictionaries. 

The third set uses 25 projections uniformly distributed in the limited range 
[0°, 120°] and with relative noise level 0.01. In this case the TV reconstructions 
display additional artifacts related to the limited-angle situation, while such 
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artifacts are somewhat less pronounced in the reconstructions by our algorithm. 

In the numerical studies performed in this paper there is an underlying 
assumption that the scale and orientation of the training images are consistent 
with the unknown image. While this assumption is convenient for the studies 
performed here, it may not be entirely realistic. In a separate work [33] we 
therefore investigated the sensitivity and robustness of the reconstruction to 
variations of the scale and orientation in the training images, and we discuss 
algorithms to estimate the correct relative scale and orientation from the data 
(scale being the more difficult parameter to estimate). 

4.6 A Large Test Case 



Figure 11: Left: high-resolution images of steel micro-structure [41] (top) and 
zirconium grains (bottom) used to generate the training images. Right: the 
corresponding exact images of size 520 x 520. 

We finish the numerical experiments with a verification of our method on two 
larger test problems that simulate the analysis of microstructure in materials 
science. Almost all common metals and many ceramics are poly crystalline, 
i.e., they are composed of many small crystals or grains of varying size and 
orientation, and the variations in orientation can be random. It is of particular 
interest to study how the grain boundaries — the interfaces between grains — 
respond external stimuli such heat, stress or strain. Here we assume that priors 
of the grain structure are available in the form of training images. 

The simulated data was computed using images of steel and zirconium grains. 
The steel microstructure image from [41] is of dimensions 900 x 1280 and the zir¬ 
conium grain image (produced by a scanning electron microscope) is 760 x 1020. 
More than 50, 000 patches are extracted from these images to learn dictionaries 
eV 2 ,V^ of size 400 X 800. To avoid doing inverse crime, we obtain the 
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exact images of dimensions 520 x 520 by first rotating the high-resolution image 
and then extracting the exact image. The high-resolution images and the exact 
images are shown in Figure 11. 


?i=1, 11=67.6, 6=100 
-— T 


;i=2.15, |i=12.02, 6=1000 


> 1 ^^= 23.357 




V t 
4 •' 


• I ! -,i 


v 








\ 


JLA.^ 1 




(a) ^20 e 2^2, RE = 0.095 (b) ^ re = 0.096 (c) TV, RE = 0.099 


>1=1, 11=67.6, 6=100 >1=1,11=12.02, 6=100 >i-p^=11.288 



(d) E>20 ^ ^ 2 , RE = 0.146 (e) ^ ^ q ;l 58 (f) TV, RE = 0.137 


Figure 12: Reconstructions of the 520 x 520 images by our method (left and 
middle) and by the TV method (right). Top: steel microstructure. Bottom: 
zirconium grains. 

We consider a parallel-beam tomographic scenario with = 50 projections 
corresponding to 50 uniformly distributed projections in [0°,180°], leading to 
m = 36, 750 measurements. We add Gaussian white noise with relative noise 
level 0.01 and compute reconstructions by our method as well as the TV method; 
these reconstruction are shown in Figure 12. All regularization parameters 
were chosen to give the best reconstruction as measured by the RF, and we 
note that the reconstruction errors are dominated by the error coming from 
the regularization of the noisy data; the approximation errors \\Pc{x^^^^^) — 
^exact||^^||^exact||^ are of the order 0.03 and 0.05 for the steel and zirconium 
images, respectively. 

As expected, the TV reconstructions exhibit “cartoonish” artifacts, and for 
the steel grains the black interfaces tend to be too thick and they are not so well 
resolved. Our method, for both P 2 and Poo, recovers better the grain interfaces 
that are of interest here. We obtain the sharpest interfaces for Poo but some 
small black “dots” have appeared which are not present for P 2 ; in both cases 
the images are suited for postprocessing via image analysis. 
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5 Conclusions 

We describe and examine an algorithm that incorporates training images as 
priors in computed tomography (CT) reconstruction problems. This type of 
priors can be useful in low-dose CT where we are faced with underdetermined 
systems of equations, and our numerical experiments focus on such problems. 

Our algorithm has two stages. In the first stage we compute a learned 
dictionary from a set of training images using a regularized nonnegative matrix 
factorization (NMF). In the second stage, via a regularized least squares fit we 
compute a nonnegative reconstruction lying in the cone defined by the dictionary 
elements; the reconstruction is sparse with respect to the dictionary. Hence, 
regularization is obtained by enforcing that the reconstruction is within the 
range of the dictionary elements and by the sparsity constraint. 

Our algorithm works with non-overlapping image patches; the same dictio¬ 
nary is used for all patches, and we are able to minimize blocking artifacts by 
an additional regularization term. This reduces the computational complex¬ 
ity, compared to all other proposed algorithms that apply a dictionary-based 
regularization based on overlapping patches around every pixel in the image. 

Our algorithm includes several regularization parameters. In the first stage 
a parameter is used to control the sparsity in the NMF, and in the second 
stage we use one parameter to control the sparsity of the representation in the 
dictionary, and another parameter to avoid blocking artifacts. We perform a 
series of numerical experiments with noisy data and without committing inverse 
crime, where we demonstrate the interplay between these parameters and the 
computed reconstructions, and we show that the reconstructions are not very 
sensitive to these parameters. Further work is needed to develop automatic 
parameter choice algorithms. 

We conclude that training images can be useful as a strong prior for regular¬ 
ization of low-dose CT problems, through a sparse representation in a nonneg¬ 
ative dictionary learned from the training images. Our reconstructions are (not 
surprisingly) superior to those computed by classical methods such as filtered 
back projection and algebraic iterative methods, and they are competitive with 
total variation (TV) reconstructions. Specifically, in our test problems our algo¬ 
rithm tends to be able to include more texture and also produces edges whose 
location is more correct. 
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A The Dictionary Learning Algorithm 

Recall that the dictionary learning problem (4) is non-convex, and hence it is 
too costly to solve it globally. We will therefore optimize locally by applying 
the Alternating Direction Method of Multipliers (ADMM) method [4] to the 
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following reformulation of (4) 

minimizei 5 ,H ^ \\Y - UV\\l + X\\H\U^^ + I^s><,{H) + Iv{D) 
subject to D = H = 

where U G and V G are auxiliary variables that are introduced in 

order to make the ADMM-updates separable and hence cheap. The augmented 
Lagrangian associated with (14) can be expressed as 

Lp{D, H, U, V, A,n) = ^ ||F - C/y III + A ||F||,„^ + (F) + Iv{D) 

+ Tr{A'^{D-U)) + Tr{n'^{H-V)) (15) 

+ ^\\D-U\\l + ^\\H-V\\l 

where A G and U G are Lagrange multipliers, and p is a fixed 

positive penalty parameter which can be chosen prior to the learning process. If 
we partition the variables into two blocks (D, V) and (iL, U) and apply ADMM 
to (14), we obtain an algorithm where each iteration involves the following three 
steps: (i) minimize Lp jointly over D and V; (ii) minimize Lp jointly over H 
and U ; and (hi) update the dual variables A and 71 by taking a gradient-ascent 
step. Since Lp is separable in D and R, step (i) can be expressed as two separate 
updates 


Dk+i = min Lp{D,Hk,Uk,Vk,Ak, Ilk) = Pv{Uk - p ^Ak) (16a) 

14+1 =rmnLp{Dk,Hk,Uk,V,Ak,nk) (16b) 

= (C/J Uk + pI)-\UlY A Uk A pHk) 

where Pv{') is the projection onto the set V. Similarly, Lp is also separable in 
H and 7/, so step (ii) can be written as 

Hk+i= min 7:p(77/,+i, 7L, 7//,, 14 + 1 , A/., 77/^) 

~ (*^A/p(^/c+l ~ P ^k)) 

Lfk+1 = nnn7:p(77/c+i,77/c,7/, R/c+i, A/c,7I/c) 

= (yy^i AAkA pz)fe+i)(i4+iiy+i + pi)-^ 

where S\/p denotes an entrywise soft-thresholding operator, and P^sxt{') is the 
projection onto the non-negative orthant. Finally, the dual variable updates in 
step (iii) are given by 


(16c) 

(16d) 


Tl/c+l — Tin + P{Dk-\-l — Uk+l) (16e) 

TI/c+1 = Pk P p(77/c+i - 14 + 1 ). (16f) 

The projection onto the set is an element-wise projection onto the in¬ 
terval [0,1] and hence easy to compute. However, the projection onto V 2 does 
not have a closed form solution, so we compute it iteratively using Dykstra’s 
alternating projection algorithm. 
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The convergence properties of ADMM when applied to non-convex problems 
of the form (14) have been studied by e.g. [39]. They show that whenever the 
sequence of iterates produced by (16) converges, the limit satisfies the the KKT- 
conditions (i.e., the first-order necessary conditions for optimality) which can 
be expressed as 

D = U, H = V, 

A = -(¥ - DH)H^, n = -D^iY - DH), 

- A G S4>dic (T>), -77 G S4>rep (77) , 

where d denotes the sub differential operator. The convergence result is some¬ 
what weak, but empirical evidence suggests that applying ADMM to non-convex 
problems often works well in practice [4] . It is interesting to note that the point 
77 = [/ = 0 and 77 = R = 0 satisfies the KKT-conditions, and although it is 
a stationary point, it is clearly not a local minima. For this reason, we avoid 
initializing with zeros. We initialize U with some of the images from the train¬ 
ing set, and we set V = [I 0] (i.e., the leading s columns of V is the identity 
matrix). 

The KKT-conditions can be used to formulate stopping criteria. We use the 
following conditions 

11^ -Climax ^ 
max(l, ||L)|| 

max) 

||77-£>^(£>g-y)|Uax ^ 

max(l, ||i7|| 

max) 

where e > 0 is a given tolerance. 

The KKT-conditions can also be used to derive an upper bound A for the 
regularization parameter A. It follows from the optimality conditions that for 
H = Osxt: n = —D^Y and hence for some A and all 77 G 77 we have 

DWg AailOaxtllsum, 

i.e., 77 = 0 satisfies the KKT-conditions for all A > A. Thus, if Y is scaled such 
that all entries in Y are between 0 and 1, then the upper bound X = p can be 
used for both dictionaries since 

sup ||DW|| 

max — max y/p\\Yej\\ 2 <p 

DeV2 

and 

sup ||-D'^y||max = max \\Yej\\i<p 
which implies that D^Y G A(9$rep(0sxt) for all D eV. 


\H-V\l 


< e 


A 


max(l, IliJlImax) 
\\A- {DH -Y)H^ 


max(l, 


0 


< e 


(17a) 

(17b) 
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